Skip to content

Improve response parsing: no ptypes, faster datetimes#513

Open
karawoo wants to merge 10 commits intomainfrom
kara-parsing
Open

Improve response parsing: no ptypes, faster datetimes#513
karawoo wants to merge 10 commits intomainfrom
kara-parsing

Conversation

@karawoo
Copy link
Copy Markdown
Collaborator

@karawoo karawoo commented Mar 5, 2026

Intent

Fixes #483

Approach

Removes the connectapi_ptypes dictionary and vctrs-based type coercion system (ensure_columns, ensure_column, vec_cast). Instead, each getter function declares its own datetime_cols and applies lightweight post-parse coercion via coerce_datetime(), which handles RFC 3339 strings, epoch seconds, POSIXct pass-through, and all-NA columns.

Parsing pipeline:

  • Connect$request() now uses jsonlite::fromJSON() instead of httr::content(as = "parsed"), giving us control over jsonlite's simplification behavior
  • parse_connect_rfc3339() uses a vectorized substr-based parser (faster than strptime on large vectors)
  • page_cursor() fetches subsequent pages with simplify=TRUE so jsonlite builds data frames in C, then combines with vctrs::vec_rbind()

I started this work while trying to improve performance of get_usage_static() (see profiling on #501 (comment)). With this branch, get_usage_static() is ~20% faster than on main for 175k records.

Checklist

  • Does this change update NEWS.md (referencing the connected issue if necessary)?
  • Does this change need documentation? Have you run devtools::document()?

@karawoo karawoo requested a review from jonkeane March 5, 2026 20:13
@karawoo karawoo mentioned this pull request Mar 5, 2026
2 tasks
Comment on lines +860 to +862
# Keep only the columns relevant to job termination; the API response
# includes extra fields (e.g. payload, guid) on error that vary by outcome.
keep <- c("app_id", "app_guid", "job_key", "job_id", "result", "code", "error")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a problem that we get variable data at this point?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that the variable data is for error cases (i.e. trying to terminate a job that is not currently active, which returns a 409 but doesn't raise an R error) vs. successful requests. I guess it's debatable whether we should be raising an R error more eagerly, but I think the behavior of accommodating different field names is consistent with main.

The one difference here is that main will always return the columns of keep whereas on this branch if any columns from keep are missing across all responses, they'd be omitted. I'll update to make it more consistent with main.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually thinking a bit in the opposite direction: if we sometimes get different fieldnames that's probably totally ok. This is probably more important (or really, more possible) when we get to having these objects not be DFs that get passed around. Then the normalization of "here are the columns you're getting" can be done at the as.data.frame() point). We don't need to do this here, it just smelled a little funny to me

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha yeah, if we were returning a list then I agree it'd make more sense to not worry about the columns -- the responses are what they are, and that may include different fields.

Comment on lines +796 to +799
# The older /applications/ endpoint returns timestamps as Unix epoch integers
# and ID fields as integers. Normalize to match the v1 endpoint's types.
# For the v1 endpoint these are already character/POSIXct, so the coercions
# are no-ops.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably not worth too much digging, but if you know off the top of your head: how old are older versions here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GET /v1/content/{guid}/jobs was added in October 2022 https://docs.posit.co/connect/news/#rstudio-connect-2022.10.0; the old applications endpoint was removed July 2025.


parse_connectapi_typed(bundles, connectapi_ptypes$bundles)
out <- parse_connectapi_typed(bundles, datetime_cols = "created_time")
coerce_fs_bytes(out, "size")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep this here, but I do wonder if providing these as fs::byte types is really useful?

Comment on lines +68 to 74
# No shiny apps are deployed in integration tests, so this may be empty.
if (nrow(shiny_usage_local) > 0) {
expect_gt(length(colnames(shiny_usage)), 1)
expect_true("content_guid" %in% names(shiny_usage_local))
expect_s3_class(shiny_usage_local$started, "POSIXct")
}
})
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the if here is because a 0-row df doesn't get coerced to this type? But is that true / is that what we want?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there are no shiny apps then the usage data returns an empty response with no rows OR columns, so we can't assert the column names/types

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, hmmm that seems not ideal, but if it's not a regression in this work that's fine. But maybe a follow up would be good?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a regression in this work. Previously we would have been able to construct a 0-row data frame with
the expected columns based on the ptypes. Since we want to remove the ptypes, we now have no way to know what columns to create. The regression here is inherent to the motivation of the PR. I'm not sure what follow-up we could do that wouldn't bring back ptypes in some form.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaah ok, I get it now: the response from the server is a different shape if there are no visits to shiny apps to report, yeah?

I did a bit of poking and it looks like that is also common across other endpoints too (e.g. content search returns an empty list when there are no matches). So that will have a similar issue too.

Comment on lines +412 to +415
id = c("8966707", "8966708", "8967206", "8967210", "8966214"),
id = c(8966707L, 8966708L, 8967206L, 8967210L, 8966214L),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these supposed to remain characters? I believe they should (given how we do our IDs), but maybe I'm missing something?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they changed because now we return whatever the API returns rather than trying to coerce to specific types, and the test fixtures have integers rather than characters. I'll update those but the behavior here is correct, and only seems confusing because these are unit tests that can get out of sync with connect's actual response types.

Copy link
Copy Markdown
Collaborator

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I commented outside of a review so that I didn't accidentally leave them without sending. Overall this looks good, and I'm pro getting this on main and using it locally to confirm it works well.

Do you think it would be a good thing to add a test or two that simulates the changes that we had to do with #512 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove or drastically overhaul type parsing even when producing data frames

2 participants